Community detection in complex networks using Extremal Optimization 
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We propose a novel method to find the community structure in complex networks based on an 
extremal optimization of the value of modularity. The method outperforms the optimal modularity 
found by the existing algorithms in the literature. We present the results of the algorithm for 
computer simulated and real networks and compare them with other approaches. The efficiency 
and accuracy of the method make it feasible to be used for the accurate identification of community 
structure in large complex networks. 
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The description of the structure of complex networks 
has been one of the focus of attention of the physicist's 
community in the recent years. The levels of descrip- 
tion range from the microscopic (degree, clustering co- 
efficient, centrality measures, etc., of individual nodes) 
to the macroscopic description in terms of statistical 
properties of the whole network (degree distribution, to- 
tal clustering coefficient, degree-degree correlations, etc.) 
ELH>H>9- Between these two extremes there is a " meso- 
scopic" description of networks that tries to explain its 
community structure. The general notion of community 
structure in complex networks was first pointed out in the 
physics literature by Girvan and Newman 5] , and refers 
to the fact that nodes in many real networks appear to 
group in subgraphs in which the density of internal con- 
nections is larger than the connections with the rest of 
nodes in the network. 

The community structure has been empirically found 
in many real technological, biological and social networks 
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and its emergence seems to be at the heart 
of the network formation process [Tl|. 

The existing methods intended to devise the commu- 
nity structure in complex networks have been recently 
reviewed in [Toj ]. All these methods require a definition 
of community that imposes the limit up to which a group 
should be considered a community. However, the concept 
of community itself is qualitative: nodes must be more 
connected within its community than with the rest of the 
network, and its quantification is still a subject of debate. 
Some quantitative definitions that came from sociology 
have been used in recent studies ^2]> but in general, the 
physics community has widely accepted a recent mea- 
sure for the community structure based on the concept 
of modularity Q introduced by Newman and Girvan 



certain mesoscopic description of the graph in terms of 
communities is more or less accurate. The larger the val- 
ues of Q the most accurate a partition into communities 
is. 

The search for the optimal (largest) modularity value 
is a NP-hard problem due to the fact that the space of 
possible partitions grows faster than any power of the 
system size. For this reason, a heuristic search strategy 
is mandatory to restrict the search space while preserv- 
ing the optimization goal [l4|. Indeed, it is possible to 
relate the current optimization problem for Q with clas- 
sical problems in statistical physics, e.g. the spin glass 
problem of finding the ground state energy |15). where 
algorithms inspired in natural optimization processes as 
simulated annealing and genetic algorithms have 
been successfully used. 

In this Letter, we propose a new divisive algorithm 
that optimizes the modularity Q using an heuristic search 
based on the Extremal Optimization (EO) algorithm pro- 
posed by Boettcher and Percus [l|| [Hi • This algorithm 
is inspired in turn in the evolution model of Bak-Sneppen 
[20| . and basically operates optimizing a global variable 
by improving extremal local variables that involve co- 
evolutionary avalanches. The performance of EO algo- 
rithms have been shown to overcome the efficiency of 
classical simulated annealing and genetic algorithms pro- 
viding competitive accuracy 21]. 

In our case, the global variable to optimize is Q as 
defined in eq.Q. Thus, the definition of the local vari- 
ables used in the extremal optimization problem should 
be related to the contribution of individual nodes i to 
the summation in eq.Q given a certain partition into 
communities 



Q = J2(e rr -a 2 r ) (1) 

r 

where e rr are the fraction of links that connect two nodes 
inside the community r, a r the fraction of links that have 
one or both vertices inside of the community r, and the 
sum extends to all communities r in a given network. 
Note that this measure provides a way to determine if a 



1% = «V(t) - ha r (i) (2) 

where i s the number of links that a node i be- 

longing to a community r has with nodes into the same 
community, and ki is the degree of node i. Note that 
Q = 5T (ft where i refers to all nodes in the network 
given a certain partition into communities and L is the 
total number of links in the network. Eq.J2J) provides a 
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measure that depends on the node degree, and its nor- 
malization involve all the links in the network after sum- 
mation. Re-scaling the local variable qi by the degree of 
node i we obtain a proper definition for the contribution 
of node i to the modularity, relative to its own degree 
and normalized in the interval [-1,1]. 

Xl = T ~T Qr W ^ 

Keeping in mind this definition of Aj we can compare 
the relative contribution of individual nodes to the com- 
munity structure. We will consider Ai as the local vari- 
able involved in the extremal optimization process that 
characterizes an individual node, from now on we will re- 
fer to Ai as the fitness of node i using the common jargon 
in extremal optimization problems. 

The heuristic search we propose to find the optimal 
modularity value evolves as follows: 

• Initially, we split the nodes of the whole graph in 
two random partitions having the same number of 
nodes each one. This splitting creates an initial 
communities division, where communities are un- 
derstood as connected components in each parti- 
tion. 

• At each time step, the system self-organizes by 
moving the node with the lower fitness (extremal) 
from one partition to the other. In principle, each 
movement implies the recalculation of the fitness of 
many nodes because the right hand side of equation 
(J3J involves the pseudo-global magnitude a r uy 

• The process is repeated until an "optimal state" 
with a maximum value of Q is reached. After that, 
we delete all the links between both partitions and 
proceed recursively with every resultant connected 
component. The process finishes when the modu- 
larity Q could not be improved |3^ . 

Note that this process is not a bipartitioning of the 
graph as known in computer science [l9|. because: the 
number of nodes in each partition is dependent on the 
evolution process and not restricted to be the same at the 
end of the process; and more importantly, each partition 
could contain different connected components (commu- 
nities) that when the partitions are disconnected result 
in several subgraphs. 

Let us illustrate the above mentioned heuristics in a 
simple case. We will apply it to the well-know Zachary 
karate club network [23. Initially we split the nodes in 
two random partitions (see Fig^ left) . Note that the 
number of initial communities (connected components in 
each partition) in this case is five (see Fig|I]right). After 
that, the self-organization process starts: the node with 
the "worst fitness" is selected and moved from its parti- 
tion to the other partition, this movement provokes an 
avalanche of changes in the fitness of the rest of nodes. 
We calculate the new value for the modularity Q, and 




FIG. 1: Left: Random initialization of the Zachary network 
into two partitions, red and green. Right: Five different com- 
munities identified as connected components in each parti- 
tions. Each color defines a different community. 



again repeat the process until no changes could improve 
it (see Fig. |2J). 

The application of the algorithm to the Zachary net- 
work provides the optimal modularity value after three 
recursive iterations. The network is decomposed in four 
communities and the value for the modularity is 0.419, 
greater than the value 0.381 reported by Newman |14j . 
the value 0.406 reported by Reichardt et al. |2^| and the 
value 0.412 reported by Donetti et al. [24( using different 
optimization methods. 

The extremal optimization (EO) approach presented 
here has several technical implementation details, that 
are relevant for our purposes. In the original EO algo- 
rithm, the node selected is always the node with the worst 
Xj value. This is a deterministic and fast way to solve 
the problem, but it presents some drawbacks: the final 
result strongly depends on the initialization and there is 
no possibility to escape from local maxima. Instead, we 
use a probabilistic selection called r-EO in which 
the nodes are ranked according to their fitness values, 
and then the node of rank r is selected according to the 
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Algorithm Steps 

FIG. 2: Top: Network after edge removal at each recursive 
cut. Bottom: Evolution of the Q value in the at each step 
of the adaptation process. Separation bars indicate recursive 
divisions of the graph performed at maximum Q. 
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following probability distribution: 

P(r) oc r~ T (4) 

This solution is less sensitive to different initializations 
and allows to escape from local maxima. The exponent 
r has been tuned around the optimal values obtained 
for random networks of size N that approach the scaling 
r ~ 1 + l/ln{N) [l^. The use of this technique also im- 
plies the determination of the number of self-organization 
steps aN needed to decide that the maximum value has 
little chance to be improved. In practice, we keep track 
at each step of the last maximum value obatined for Q, if 
this maximum is not improved in aN steps we stop the 
search. Usually a is empirically determined balancing 
accuracy and efficiency in the algorithm, we use a = 1 
allowing as many steps as nodes to improve the current 
maximum value of Q. The computational cost involved 
in the whole process is 0(N 2 ln 2 N) where a factor NlnN 
is the cost associate to the ranking process, however it 
can be substantially reduced using heap data structures 
|25| for the ranking selection process up to 0(N). The 
total cost of the algorithm can then be improved up to 
0{N 2 lnN). 

To test the performance of the algorithm we use first 
computer-generated graphs with a known community 
structure [5j. These graphs have 128 vertices grouped in 
four communities of 32 vertices. Each vertex has on aver- 
age Zi n edges to vertices in the same community and z ou t 
edges to vertices in other communities, keeping an aver- 
age degree z in + z out = 16. We generate several graphs 
using Zout values between and 10, and compare the 
results of our algorithm with those obtained using the 
heuristics proposed by Newman 0|. This shows the ca- 
pabilities of each algorithm identifying the communities 
when these are more fuzzy inside the whole network. Us- 
ing the Girvan-Newman algorithm, wich is the reference 
algorithm for community identification, the communities 
are well detected until values of z out = 6. In contrast, 
our algorithm detects the communities up to z out — 8, 
where the community structure still persist but is much 
more difficult to reveal, see Fig|3]- In this particular case 
50 percent of the links are within the community and 50 
percent are links with nodes outside the community. This 
result that could seem contradictory is not. Note that the 
50 per cent of links with nodes outside the community are 
equally distributed among the rest of communities, and 
then its contribution to the definition of community is 
deprived by the number of communities in the rest of the 
network, in our case three. For this reason it is expected 
to find community structure even in these cases. 

For values higher than 8, the average maximum mod- 
ularity rapidly approach the limit Q — 0.208 (see inset 
FigEJ), the expected modularity for a random network 
with the same number of links and nodes, as it has been 
shown in |26j . 

We have also analyzed the community structure of sev- 
eral real networks: the jazz musicians network |27| . an 
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FIG. 3: Fraction of nodes correctly classified using computer- 
generated graphs described in text. Each point is an average 
over 100 different networks. Inset: Average of the maximum 
modularity obtained in each case. 

universitye-mail network |ll| . the C.elegans metabolic 
network j2£|, a network of users of the PGP algorithm for 
secure information transactions |29| , and finally the rela- 
tions between authors that shared a paper in cond-mat 



Network 


Size 


Qn 


^comsN 


Qeo 


#comsEO 


Zachary 


34 


0.3810 


2 


0.4188 


4 


Jazz 


198 


0.4379 


4 


0.4452 


4 


C. elegans 


453 


0.4001 


10 


0.4342 


12 


E-mail 


1133 


0.4796 


13 


0.5738 


15 


PGP 


10680 


0.7329 


80 


0.8459 


365 


Cond-Mat 


27519 


0.6683 


302 


0.6790 


647 



TABLE I: Maximum modularity obtained using the algorithm 
[T^l Qn and the extremal optimization algorithm Qeo for 
different complex networks. It is also included the number of 
communities found at the configuration with maximum mod- 
ularity. 

In Table [3 we present the results for the maximum 
modularity achieved by our algorithm compared to the 
modularity obtained using [T^j . The difference in maxi- 
mum modularity is up to 15% depending on the network 
considered. These differences result in a best determina- 
tion of the unknown community structure of the whole 
network. The partition into communities is clearly differ- 
ent as shows the different number of communities found 
using both algorithms. 

Note that since the core of the algorithm is stochastic, 
different runs could yield in principle different partitions. 
We have performed 100 runs of the algorithm for the e- 
mail network and for a random network with the same 
number of links and nodes to check the consistency of 
the proposed method. In Fig. 0]we present the results 
of the fraction of times a couple of nodes are classified in 
the same partition. The community structure is clearly 
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revealed for the e-mail network while for the random net- 
work this structure is inexistent. Recently, Guimera and 
Amaral have obtained similar results by applying simu- 
lated annealing to find the community structure in the 
context of metabolic networks jU ■ 

Summarizing, we have presented an extremal optimiza- 
tion based algorithm that optimizes the modularity and 
allows an accurate identification of community structure 
in complex networks. The results outperform all previous 
algorithms existent in the literature. 
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FIG. 4: Fraction of nodes classified in the same partition over 
100 realizations of the algorithm. The color of the position 
(i,j) corresponds to the fraction of times that nodes i and j 
belong to the same partition. 
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